Self-training of question answering system using question profiles

ABSTRACT

Mechanisms are provided to implement a self-training engine of a question and answer system. The self-training engine pairs an unanswered natural language question with portions of an electronic document to generate an unlabeled data set. The self-training engine trains a model based on a labeled data set comprising a finite number of question and answer pair data structures and applies the model to the unlabeled data set to identify an answer from the portions of the electronic document to the unanswered natural language question. The self-training engine modifies the labeled data set to include the identified answer and corresponding unanswered natural language question as an additional question and answer pair data structure. The self-training engine then trains the model based on the modified labeled data set.

BACKGROUND

The present application relates generally to an improved data processingapparatus and method and more specifically to mechanisms for providing aself-training functionality for a question answering system usingquestion profiles.

Decision-support systems exist in many different industries where humanexperts require assistance in retrieving and analyzing information. Anexample that will be used throughout this application is a diagnosissystem employed in the healthcare industry. Diagnosis systems can beclassified into systems that use structured knowledge, systems that useunstructured knowledge, and systems that use clinical decision formulas,rules, trees, or algorithms. The earliest diagnosis systems usedstructured knowledge or classical, manually constructed knowledge bases.The Internist-I system developed in the 1970s uses disease-findingrelations and disease-disease relations. The MYCIN system for diagnosinginfectious diseases, also developed in the 1970s, uses structuredknowledge in the form of production rules, stating that if certain factsare true, then one can conclude certain other facts with a givencertainty factor. DXplain, developed starting in the 1980s, usesstructured knowledge similar to that of Internist-I, but adds ahierarchical lexicon of findings.

Iliad, developed starting in the 1990s, adds more sophisticatedprobabilistic reasoning where each disease has an associated a prioriprobability of the disease (in the population for which Iliad wasdesigned), and a list of findings along with the fraction of patientswith the disease who have the finding (sensitivity), and the fraction ofpatients without the disease who have the finding (1-specificity).

In 2000, diagnosis systems using unstructured knowledge started toappear. These systems use some structuring of knowledge such as, forexample, entities such as findings and disorders being tagged indocuments to facilitate retrieval. ISABEL, for example, uses Autonomyinformation retrieval software and a database of medical textbooks toretrieve appropriate diagnoses given input findings. Autonomy Auminenceuses the Autonomy technology to retrieve diagnoses given findings andorganizes the diagnoses by body system. First CONSULT allows one tosearch a large collection of medical books, journals, and guidelines bychief complaints and age group to arrive at possible diagnoses. PEPIDDDX is a diagnosis generator based on PEPID's independent clinicalcontent.

Clinical decision rules have been developed for a number of medicaldisorders, and computer systems have been developed to helppractitioners and patients apply these rules. The Acute Cardiac IschemiaTime-Insensitive Predictive Instrument (ACI-TIPI) takes clinical and ECGfeatures as input and produces probability of acute cardiac ischemia asoutput to assist with triage of patients with chest pain or othersymptoms suggestive of acute cardiac ischemia. ACI-TIPI is incorporatedinto many commercial heart monitors/defibrillators. The CaseWalkersystem uses a four-item questionnaire to diagnose major depressivedisorder. The PKC Advisor provides guidance on 98 patient problems suchas abdominal pain and vomiting.

SUMMARY

This Summary is provided to introduce a selection of concepts in asimplified form that are further described herein in the DetailedDescription. This Summary is not intended to identify key factors oressential features of the claimed subject matter, nor is it intended tobe used to limit the scope of the claimed subject matter.

In one illustrative embodiment, a method is provided, in a dataprocessing system comprising a processor and a memory, the memorycomprising instructions which cause the processor to be configured toimplement a self-training engine of a question and answer system. Themethod comprises pairing, by the self-training engine, an unanswerednatural language question with portions of an electronic document togenerate an unlabeled data set. The method further comprises training,by the self-training engine, a model based on a labeled data setcomprising a finite number of question and answer pair data structures.Moreover, the method comprises applying, by the self-training engine,the model to the unlabeled data set to identify an answer from theportions of the electronic document to the unanswered natural languagequestion. In addition, the method comprises modifying, by theself-training engine, the labeled data set to include the identifiedanswer and corresponding unanswered natural language question as anadditional question and answer pair data structure. Furthermore, themethod comprises training, by the self-training engine, the model basedon the modified labeled data set.

In other illustrative embodiments, a computer program product comprisinga computer useable or readable medium having a computer readable programis provided. The computer readable program, when executed on a computingdevice, causes the computing device to perform various ones of, andcombinations of, the operations outlined above with regard to the methodillustrative embodiment.

In yet another illustrative embodiment, a system/apparatus is provided.The system/apparatus may comprise one or more processors and a memorycoupled to the one or more processors. The memory may compriseinstructions which, when executed by the one or more processors, causethe one or more processors to perform various ones of, and combinationsof, the operations outlined above with regard to the method illustrativeembodiment.

These and other features and advantages of the present invention will bedescribed in, or will become apparent to those of ordinary skill in theart in view of, the following detailed description of the exampleembodiments of the present invention.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention, as well as a preferred mode of use and further objectivesand advantages thereof, will best be understood by reference to thefollowing detailed description of illustrative embodiments when read inconjunction with the accompanying drawings, wherein:

FIG. 1 depicts an overview of this self-training functionality usingquestion profiles in accordance with one illustrative embodiment;

FIGS. 2A and 2B illustrate examples of question profiles in accordancewith one illustrative embodiment;

FIG. 3 depicts a schematic diagram of one illustrative embodiment of acognitive system in a computer network, in which a question profilebased self-training engine is implemented to train one or more modelsutilized by the cognitive system;

FIG. 4 is a block diagram of an example data processing system in whichaspects of the illustrative embodiments are implemented;

FIG. 5 is an example diagram illustrating an interaction of elements ofa healthcare cognitive system in accordance with one illustrativeembodiment; and

FIG. 6 is a flowchart outlining an example operation for performing aniterative question profile based self-training operation in accordancewith one illustrative embodiment.

DETAILED DESCRIPTION

The strengths of current cognitive systems, such as current medicaldiagnosis, patient health management, patient treatment recommendationsystems, law enforcement investigation systems, and other decisionsupport systems, are that they can provide insights that improve thedecision making performed by human beings. For example, in the medicalcontext, such cognitive systems may improve medical practitioners'diagnostic hypotheses, can help medical practitioners avoid missingimportant diagnoses, and can assist medical practitioners withdetermining appropriate treatments for specific diseases. However,current systems still suffer from significant drawbacks which should beaddressed in order to make such systems more accurate and usable for avariety of applications as well as more representative of the way inwhich human beings make decisions, such as diagnosing and treatingpatients. In particular, one drawback of current systems is theconsiderable amount of human effort and resource expenditure needed totrain such cognitive systems to perform their cognitive operations.

Cognitive systems, such as natural language based Question Answering(QA) based cognitive systems, are trained using machine learningtechnology that requires text that is annotated by human subject matterexperts (SMEs). In order to successfully train a machine-learning modelof a cognitive system, e.g., QA based cognitive system, the trainingset, e.g., a set of training questions and corresponding known(ground-truth) answers, used to perform the training need to berealistic and representative of requests/questions that would besubmitted/asked by a user of the cognitive system. For purposes of thefollowing discussion, it will be assumed that the cognitive system is aQA system but it should be appreciated that any cognitive system thatprocesses requests/questions may be used without departing from thespirit and scope of the illustrative embodiments. The set of trainingquestions and the corresponding answers are used to train the QA basedcognitive system by modifying operational parameters of the QA basedcognitive system so that it generates the correct given answer for thetraining question.

In the medical domain, for a patient-specific QA based cognitive system(hereafter referred to as a QA system), the requirement to havequestions that are representative of actual questions that would beasked by the patient, or the medical professional treating the patient,makes it especially challenging and expensive to obtain a set of suchtraining question-answer (QA) pairs. While it is challenging toautomatically gather questions that a medical professional, e.g., thepatient's physician, would ask about patients during the course ofmedical professional's day, manually identifying correct answers tothese questions (words, phrases or passages in the patient's medicalrecord) is an even more time-consuming task and resource intensive task.

Thus, overall, the training of a QA system takes considerable humaneffort and time, especially in the medical domain, although similarlevels of human effort and time may be present in other domains as well.In order to ensure that all aspects of a patient's electronic medicalrecords (EMRs) are covered by the QA system, a large set of trainingquestions and ground truth answers (typically tens of thousands ofquestion-answer (QA) pairs) must be generated and used to train the QAsystem. This takes considerable time and effort to accomplish. Throughstudies it has been noted that the process of ground truth answergeneration for training a medical domain QA system from a patient's EMRsmay take approximately a year of human effort to generate ground truthanswers for approximately 1750 training questions with a team of 11 SMEs(medical expert annotators) working on the task intermittently. Incomparison, the question generation aspect of generating QA pairs fortraining the QA system is relatively faster with a single SME being ableto generate approximately 15 questions per hour. This results in arelatively small training set of QA pairs and a larger set of unansweredquestions. The challenge is to expand the relatively small labeledtraining set of QA pairs effectively without having to rely on a humanbased annotation effort. It should be appreciated that a “labeled”training set refers to the QA pair having an associated label indicatingwhether or not the “answer” of the QA pair answers the corresponding“question” of the QA pair. The label is often a binary 1 or 0 valuewhere 1 indicates that the “answer” answers the “question” and a 0indicates that the “answer” does not answer the “question” in the QApair.

Self-training is a method that has been successfully used to addresslimited training data on many natural language processing (NLP) tasks,such as parsing, information extraction, word sense disambiguation, andthe like. However, the self-training methodology has not been applieddirectly to the QA pair generation task for self-training a QA system,and especially not with regard to the many challenges specific to themedical domain and handling questions directed to, or answerable by,patient EMRs. Even in other domains, self-training has not been appliedto the QA pair generation problem in the manner set forth hereafter withregard to the illustrative embodiments described herein.

The illustrative embodiments provide mechanisms for performingself-training of a QA system by pairing unanswered questions withportions of documentation to generate new QA pairs used for training theQA system. In the following description of the illustrative embodiments,various terms will be utilized having the following definitions specificto the illustrative embodiments:

1. Labeled Set (or Labeled Training Set): Questions that have beenassociated with correct answers. An example <question, answer> pair inthe labeled training set is as follows:

<What medications does the patient take for diabetes?, a regular dose ofinsulin> The label corresponding to this instance would be 1, therebyindicating that the answer correctly answers the question.

2. Unlabeled Set: Questions that have not yet been associated withcorrect answers and thus, do not have an associated label indicatingthat an answer is correct or not correct for the question.

3. Self-Training: A semi-supervised learning method for utilizingunlabeled data with labeled data to create a better machine learningmodel.

4. Question Profile: Attributes of the question and expected answer.

5. QA Pairs: A natural language question paired with a correct answer.

6. Answer: A fact, one or more passages of natural language text, one ormore documents of natural language text, which is responsive to anatural language question.

7. QA Data Set/QA Training Set: A set of natural language questionsalong with their correct answers.

8. Corpus/Corpora: A collection of text documents covering informationfrom a wide variety of topics and sources, e.g., a clinical corpus maycontain electronic medical records for different patients potentiallygenerated by, or otherwise available from, various sources.9. Electronic Medical Records (EMRs): A collection of medicalinformation for a patient stored in electronic form as one or more datastructures. The medical information may be obtained from one or moresources of information and may represent various types of informationgenerally directed to the medical condition of the patient at one ormore points in time.10. Clinical Notes: Portions of an EMR that represent results,observations, diagnosis, treatment instructions, or the like, generatedby one or more medical personnel, medical equipment, or the like, andadded to the EMR as a result of an interaction with a patient for thepurposes of treating the patient. Clinical notes may be provided asnatural language textual notes added to the EMR or may be provided in amore structured format. In some cases, clinical notes may be added tothe patient's EMRs automatically by automated medical equipment.

With these definitions in mind, in the context of a medical domain basedQA system, the illustrative embodiments generate QA pairs and performself-training of a QA system using such QA pairs by pairing unansweredquestions with clinical notes from a patient's electronic medical record(EMR). It should be appreciated that while medical domain based QAsystem self-training will be used as a basis for describing theoperation of the illustrative embodiments, the present invention is notlimited to this domain or such embodiments. To the contrary, themechanisms of the illustrative embodiments may be utilized in any domaindepending on the desired implementation, e.g., legal domain, financialdomain, law enforcement domain, or any other domain where self-trainingof a QA system is desirable to avoid extensive expenditure of humanresources and time to train the QA system.

Again, assuming a medical domain based QA system implementation,unanswered questions are paired with all of the clinical notes of thecorresponding patient's EMR to form an unlabeled set of questions andanswers. The self-training process trains a model on the limited labeleddata, applies the semi-trained model on the unlabeled set, and selectsone or more question-answer (QA) pairs for which correspondingconfidence scores are sufficiently high, e.g., equal to or greater thana predetermined threshold. These QA pairs are then added to the labeledset of QA pairs, where the labeled set of QA pairs is the training setused to train the model of the QA system. This process is iterativelyrepeated with incrementally expanding the labeled set of QA pairs untilconvergence, e.g., no more QA pairs having sufficiently high confidenceare identified.

This iterative process for generating the training set of QA pairsautomatically based on a limited size labeled set, to therebyiteratively expand the training set, and then utilizing the expandedtraining set of QA pairs to train the QA system, improves the QA systemperformance by providing a larger training set for training the QAsystem which captures more of the possible questions and correspondinganswers that actual patients/medical professionals would ask the QAsystem during runtime and tuning the operation of the QA system toaccommodate such. In addition, the illustrative embodiments furtherimprove upon this process by incorporating question profiles into theprocess of selecting the QA pairs from the unlabeled set. The questionprofiles provide a mechanism for selecting high confidence scoring QApairs (where the score is greater than a specific threshold) from theunlabeled set of QA pairs to be included in the labeled set of QA pairs.The profile for a question, or the “question profile,” comprises thecharacteristics of the question and correct answer to the question,including, for example, key entities, relations between key entities,answer type, whether the answer is likely to occur across multipleportions of documents, e.g., notes in a patient's EMR, or a singleportion, e.g., a single note in the patient EMR, and other pertinentcharacteristics The question profile also stores information on allother questions in the question set (labeled and unlabeled questions)that are “similar” to this question, where similarity implies that bothquestions are essentially asking for the same answer, which may bedetermined from features extracted from the questions including focus,lexical answer type, and the like. Moreover, additional featuresindicative of similarity may also be utilized with logic specificallyconfigured for identifying such instances of features and marking themas similarities between questions, e.g., temporally similar questionsthat map to identical time periods, questions that typically map to thesame clinical note type around the same time period, questions where theanswer is found in all clinical notes of the same clinical note type orsame clinical note section, etc.

All of this information from a question's profile can be used toestimate the following: a) types of clinical notes where the answer islikely to be found, b) number of answer clinical notes corresponding toa question, and c) clusters of “similar” questions. This information canbe used to influence the self-training process by setting a dynamicthreshold based on the question's profile for selecting QA pairs fromthe unlabeled set, e.g., the dynamic threshold may be based on the typeof clinical notes where the answer is likely to be found, the number ofanswer clinical notes, and/or thresholds associated with other similarquestions in a cluster. This allows for considerable flexibility byallowing multiple clinical notes to be considered as an answer forcertain question types. Further, these mechanisms can also be used for“label propagation”, i.e. propagating labels or values indicating thatan answer to a question actually answers the question or not, amongquestions that are from the same cluster when applying the trained modelto the QA pairs in the unlabeled set.

Thus, the utilization of question profiles takes into account thelongitudinal nature of the EMR data by allowing the system to choose adynamic threshold and thus, finding the answer across multiple clinicalnotes. The implementation of question profiles also helps improve theoverall accuracy of the system using label propagation. As noted above,the use of such question profiles is not limited to QA systems operatingin the medical domain. The mechanisms of the illustrative embodimentsmay be implemented and applied to QA over any longitudinal dataset i.e.a dataset with related documents generated over time, for e.g.,biographical articles, law documents, reports about evolving newsevents, financial documentation, law enforcement documentation, and/orthe like.

Before beginning the discussion of the various aspects of theillustrative embodiments in more detail, it should first be appreciatedthat throughout this description the term “mechanism” will be used torefer to elements of the present invention that perform variousoperations, functions, and the like. A “mechanism,” as the term is usedherein, may be an implementation of the functions or aspects of theillustrative embodiments in the form of an apparatus, a procedure, or acomputer program product. In the case of a procedure, the procedure isimplemented by one or more devices, apparatus, computers, dataprocessing systems, or the like. In the case of a computer programproduct, the logic represented by computer code or instructions embodiedin or on the computer program product is executed by one or morehardware devices in order to implement the functionality or perform theoperations associated with the specific “mechanism.” Thus, themechanisms described herein may be implemented as specialized hardware,software executing on general purpose hardware, software instructionsstored on a medium such that the instructions are readily executable byspecialized or general purpose hardware, a procedure or method forexecuting the functions, or a combination of any of the above.

The present description and claims may make use of the terms “a”, “atleast one of”, and “one or more of” with regard to particular featuresand elements of the illustrative embodiments. It should be appreciatedthat these terms and phrases are intended to state that there is atleast one of the particular feature or element present in the particularillustrative embodiment, but that more than one can also be present.That is, these terms/phrases are not intended to limit the descriptionor claims to a single feature/element being present or require that aplurality of such features/elements be present. To the contrary, theseterms/phrases only require at least a single feature/element with thepossibility of a plurality of such features/elements being within thescope of the description and claims.

Moreover, it should be appreciated that the use of the term “engine,” ifused herein with regard to describing embodiments and features of theinvention, is not intended to be limiting of any particularimplementation for accomplishing and/or performing the actions, steps,processes, etc., attributable to and/or performed by the engine. Anengine may be, but is not limited to, software, hardware and/or firmwareor any combination thereof that performs the specified functionsincluding, but not limited to, any use of a general and/or specializedprocessor in combination with appropriate software loaded or stored in amachine readable memory and executed by the processor. Further, any nameassociated with a particular engine is, unless otherwise specified, forpurposes of convenience of reference and not intended to be limiting toa specific implementation. Additionally, any functionality attributed toan engine may be equally performed by multiple engines, incorporatedinto and/or combined with the functionality of another engine of thesame or different type, or distributed across one or more engines ofvarious configurations.

In addition, it should be appreciated that the following descriptionuses a plurality of various examples for various elements of theillustrative embodiments to further illustrate example implementationsof the illustrative embodiments and to aid in the understanding of themechanisms of the illustrative embodiments. These examples intended tobe non-limiting and are not exhaustive of the various possibilities forimplementing the mechanisms of the illustrative embodiments. It will beapparent to those of ordinary skill in the art in view of the presentdescription that there are many other alternative implementations forthese various elements that may be utilized in addition to, or inreplacement of, the examples provided herein without departing from thespirit and scope of the present invention.

The present invention may be a system, a method, and/or a computerprogram product. The computer program product may include a computerreadable storage medium (or media) having computer readable programinstructions thereon for causing a processor to carry out aspects of thepresent invention.

The computer readable storage medium can be a tangible device that canretain and store instructions for use by an instruction executiondevice. The computer readable storage medium may be, for example, but isnot limited to, an electronic storage device, a magnetic storage device,an optical storage device, an electromagnetic storage device, asemiconductor storage device, or any suitable combination of theforegoing. A non-exhaustive list of more specific examples of thecomputer readable storage medium includes the following: a portablecomputer diskette, a hard disk, a random access memory (RAM), aread-only memory (ROM), an erasable programmable read-only memory (EPROMor Flash memory), a static random access memory (SRAM), a portablecompact disc read-only memory (CD-ROM), a digital versatile disk (DVD),a memory stick, a floppy disk, a mechanically encoded device such aspunch-cards or raised structures in a groove having instructionsrecorded thereon, and any suitable combination of the foregoing. Acomputer readable storage medium, as used herein, is not to be construedas being transitory signals per se, such as radio waves or other freelypropagating electromagnetic waves, electromagnetic waves propagatingthrough a waveguide or other transmission media (e.g., light pulsespassing through a fiber-optic cable), or electrical signals transmittedthrough a wire.

Computer readable program instructions described herein can bedownloaded to respective computing/processing devices from a computerreadable storage medium or to an external computer or external storagedevice via a network, for example, the Internet, a local area network, awide area network and/or a wireless network. The network may comprisecopper transmission cables, optical transmission fibers, wirelesstransmission, routers, firewalls, switches, gateway computers and/oredge servers. A network adapter card or network interface in eachcomputing/processing device receives computer readable programinstructions from the network and forwards the computer readable programinstructions for storage in a computer readable storage medium withinthe respective computing/processing device.

Computer readable program instructions for carrying out operations ofthe present invention may be assembler instructions,instruction-set-architecture (ISA) instructions, machine instructions,machine dependent instructions, microcode, firmware instructions,state-setting data, or either source code or object code written in anycombination of one or more programming languages, including an objectoriented programming language such as Java, Smalltalk, C++ or the like,and conventional procedural programming languages, such as the “C”programming language or similar programming languages. The computerreadable program instructions may execute entirely on the user'scomputer, partly on the user's computer, as a stand-alone softwarepackage, partly on the user's computer and partly on a remote computeror entirely on the remote computer or server. In the latter scenario,the remote computer may be connected to the user's computer through anytype of network, including a local area network (LAN) or a wide areanetwork (WAN), or the connection may be made to an external computer(for example, through the Internet using an Internet Service Provider).In some embodiments, electronic circuitry including, for example,programmable logic circuitry, field-programmable gate arrays (FPGA), orprogrammable logic arrays (PLA) may execute the computer readableprogram instructions by utilizing state information of the computerreadable program instructions to personalize the electronic circuitry,in order to perform aspects of the present invention.

Aspects of the present invention are described herein with reference toflowchart illustrations and/or block diagrams of methods, apparatus(systems), and computer program products according to embodiments of theinvention. It will be understood that each block of the flowchartillustrations and/or block diagrams, and combinations of blocks in theflowchart illustrations and/or block diagrams, can be implemented bycomputer readable program instructions.

These computer readable program instructions may be provided to aprocessor of a general purpose computer, special purpose computer, orother programmable data processing apparatus to produce a machine, suchthat the instructions, which execute via the processor of the computeror other programmable data processing apparatus, create means forimplementing the functions/acts specified in the flowchart and/or blockdiagram block or blocks. These computer readable program instructionsmay also be stored in a computer readable storage medium that can directa computer, a programmable data processing apparatus, and/or otherdevices to function in a particular manner, such that the computerreadable storage medium having instructions stored therein comprises anarticle of manufacture including instructions which implement aspects ofthe function/act specified in the flowchart and/or block diagram blockor blocks.

The computer readable program instructions may also be loaded onto acomputer, other programmable data processing apparatus, or other deviceto cause a series of operational steps to be performed on the computer,other programmable apparatus or other device to produce a computerimplemented process, such that the instructions which execute on thecomputer, other programmable apparatus, or other device implement thefunctions/acts specified in the flowchart and/or block diagram block orblocks.

The flowchart and block diagrams in the Figures illustrate thearchitecture, functionality, and operation of possible implementationsof systems, methods, and computer program products according to variousembodiments of the present invention. In this regard, each block in theflowchart or block diagrams may represent a module, segment, or portionof instructions, which comprises one or more executable instructions forimplementing the specified logical function(s). In some alternativeimplementations, the functions noted in the block may occur out of theorder noted in the figures. For example, two blocks shown in successionmay, in fact, be executed substantially concurrently, or the blocks maysometimes be executed in the reverse order, depending upon thefunctionality involved. It will also be noted that each block of theblock diagrams and/or flowchart illustration, and combinations of blocksin the block diagrams and/or flowchart illustration, can be implementedby special purpose hardware-based systems that perform the specifiedfunctions or acts or carry out combinations of special purpose hardwareand computer instructions.

As noted above, the present invention provides mechanisms forimplementing a self-training functionality for a question answeringsystem using question profiles. An overview of this self-trainingfunctionality using question profiles, in accordance with oneillustrative embodiment, is depicted in FIG. 1. As shown in FIG. 1, thetraining set size is iteratively increased by automatically assigninglabels to unlabeled set (U) of question and answer pair candidates 120using a machine learning (ML) model 140 trained in a previous iterationof the self-training process. The question profile 130 associated witheach unanswered question in the unlabeled set U 120, as determined froman association of the unanswered question to an answered question in thelabeled training set (L) 110, is used to influence the threshold and thenumber of question-answer (e.g., clinical note) pairs, through process150, that are added from the unlabeled set U 120 to the labeled trainingset (L) 110, which is then used to train the machine learning model 140for the next iteration.

Thus, given a small set of labeled QA pairs (L) 110 that can beconsidered as labeled training set for the machine learning (ML) model,a larger set of unanswered questions paired with each clinical note inthe corresponding EMR, i.e. unlabeled set (U) 120, and question profiles130 associated with each question in the labeled training set, thedetailed set of operations of the self-training process shown in FIG. 1are as follows:

1. Train the machine learning model 140: A model 140 is created usingthe labeled training set L 110.

2. Apply the trained model 140 to the unlabeled set U 120: the model 140is applied to the unlabeled set U 120 of unanswered questions that arepaired with all clinical notes of a patient EMR to generate confidencescores associated with each of these pairings, where the confidencescores indicate a confidence of the model 140 that the answer of a QApair is a correct answer for the unanswered question of the QA pair.This generates, for each unanswered question, a set of answers andconfidence scores associated with each of the answers, where the answersare clinical notes in the EMR.3. For each question in the unlabeled set U 120, identify a questionprofile: For each question in the unlabeled set U 120, a correspondingquestion profile is identified from the question profile set 130 thatcorresponds to the labeled training set L 110. That is, a similarityevaluation is performed for determining a similarity between theunanswered question in the unlabeled set U 120 and a question in thelabeled training set 110. A similarity score based on a correlation offeatures may be generated and a highest scoring similar question in thelabeled training set L 110 is identified for the unanswered question inthe unlabeled set U 120. The selected similar question in the labeledtraining set L 110 has a corresponding question profile which is thenapplied to the unanswered question in the unlabeled set U 120 toidentify a corresponding confidence score threshold value and/or anumber of QA pairs to select for inclusion in the labeled training set L110 for the unanswered question in the unlabeled set U 120.4. For each question in the unlabeled set U 120, a dynamic threshold forselecting QA pairs based on their confidence scores is determined: Ingeneral, QA pairs are selected for each question based on the confidencescore meeting or exceeding a threshold confidence score and thus,indicating a high confidence. Here, “high” confidence is determined by athreshold value, where typically, QA pairs meeting or exceeding athreshold confidence score value are considered to have a “high”confidence and can be added to the labeled training set L 110. Thethreshold value determines the confidence of the model 140 for a QA pairin the unlabeled set U 120 being classified as positive, i.e. that theanswer is a correct answer for the question. With the illustrativeembodiments, instead of simply using a “fixed” threshold, a “dynamic”threshold may be used based on a question's corresponding questionprofile.

In order to determine the threshold, information from the questionprofile for the question is used to determine what threshold confidencescore value to utilize and/or the number of QA pairs to select.Characteristics of the question and its answer may be identified in theprofile that indicate various criteria for determining a confidencescore threshold/number of QA pairs to select. For example, if a questionprofile indicates that the answer is expected to be “evenly distributedacross the EMR, with a cardinality of approximately 20 clinical notes”,this information may indicate that the top 20 QA pairs from theunlabeled set U 120 to the labeled training set L 110. Similarly, if thequestion profile indicates that a particular question type typically hasonly 1 clinical note based answer associated with it, only the topscoring QA pair from the unlabeled set U 120 is added to the labeledtraining set L 110. If the question profile for the question lacks thisinformation, or if a sufficiently similar question is not found suchthat a corresponding question profile is not associated with thequestion, then a default operation is implemented that utilizes a fixedthreshold confidence score value and/or corresponding default number ofQA pairs are added from the unlabeled set U 120 to the labeled trainingset L 110, where the value is tuned on a separate development set (smallset of labeled data used for parameter tuning) 160. In machine learning,a development set is a labeled dataset that is independent of thetraining set. This development set is used to tune parameters for thelearning processing including threshold values.

5. Select QA pairs to add to labeled training set L 110: Based on theidentified a dynamic threshold or default threshold, and the confidencescores associated with the QA pairs in the unlabeled set U 120 for thequestion, QA pairs that have confidence scores meeting or exceeding thethreshold confidence score value are selected, via process 150. Thenumber of such QA pairs that are selected may be based on the dynamicdetermination corresponding to the question profile.6. Label propagation: Labels are propagated to other questions belongingto a same cluster as indicated by characteristics in the questionprofile 130 associated with the question. That is, a clusteringoperation may be performed on questions in the unlabeled set U 120 basedon characteristics in the question profiles associated with thosequestions, or based on features extracted from those questions. Forinstance, a question cluster may contain a set of questions with a highsemantic similarity score, i.e. all the questions in the same clusterare essentially asking for the same answer in different ways. One suchquestion may get classified with a particular answer, e.g., clinicalnote, with high confidence as positive, e.g., a value of 1 indicatingthe answer is correct for the question. This positive label of theanswer may now be assigned to all other questions in the questioncluster with regard to this answer, thus increasing label accuracy ofthe QA pairs added from the unlabeled set U 120 to the labeled trainingset L 110.7. Update Trained Model based on Expanded Labeled Training Set L′ 110:The trained machine learning model 140 is updated based on the newlyexpanded labeled training set L′ which includes more QA pairs than aprevious iteration.

The above process of operations 1-7 may be repeated until all theunanswered questions in the unlabeled set U 120 are answered andcorresponding QA pairs are added to the labeled training set L 110, orthe training machine learning model 140 converges, i.e. the changes toparameters of operation of the model 140 do not change above apredetermined threshold value.

As noted above, the illustrative embodiments provide a self-trainingfunctionality that is based, at least in part, on question profiles 130associated with questions in the labeled training set L 110. Thequestion profiles 130 may be generated manually be a subject matterexpert (SME). The question profiles, or at least portions of thequestion profiles, may be generated automatically using automated toolsthat may implement a machine learning algorithm that analyzes questionsand corresponding answers in the labeled training set L 110 to generatefeatures and predict profile characteristics for other questions.Moreover, a rule-based algorithm may be implemented to analyze questionsand predict profile characteristics, where the rules may be generatedwith the assistance of manual intervention by a SME, for example. Anyautomated, manual, or semi-automated methodology for generatingquestions profiles 130 may be used without departing from the spirit andscope of the present invention.

FIGS. 2A and 2B illustrate examples of question profiles in accordancewith one illustrative embodiment. The question profile 220 in FIG. 2A isfor a question 210 in the labeled training set L 110 of the type “Whenwas the patient's last lipid panel performed?” The question profile 240in FIG. 2B is for a question 230 in the labeled training set L 110 ofthe type “Why were x-rays done on the forearm and hand?” The questionprofiles may be correlated with other similar questions, such asunanswered questions in the unlabeled set U 120, based on similaritymeasures between characteristics specified in the question profile withfeatures extracted from the unanswered questions, or features extractedfrom the answered questions 210 and 230 with features extracted from theunanswered questions.

As shown in FIGS. 2A and 2B, the question profile comprises a questiontype which indicates the goal of the question, e.g., to obtain aparticular fact (fact-finding type), to determine the reason behind afact (reason type), or the like. The question profile further comprisesa question template which indicates the general format of the questionwith tags, e.g., “<test>”, “<body_parts>”, etc., indicating portions ofthe template that correspond to key entities. The question profilefurther comprises an answer type which indicates a classification of thenature of the answer, e.g., “temporal” meaning the answer is a date,time, or other temporal term/phrase, “passage” meaning that the answeris a passage from the EMR, “notes” meaning that the answer is found inclinical notes in the EMR, etc.

The question profile may also comprise one or more key entities andcorresponding key entity types and modifiers which indicate the keyfeatures extracted from the question and characterize what the questionis asking for, i.e. the key features that are indicative of the answer.For example, in FIG. 2A, the key entity is “lipid panel” and the keyentity type is “test,” with a key entity modifier of “last”, which isindicative of the answer being a temporal answer for the last lipidpanel test. In FIG. 2B, the key entity and entity types of “xrays” and“test” along with the answer types and question type indicates that thequestion is asking for the reason for the x-ray test and that the answeris a passage from clinical notes.

The question profile further comprises a number of answerscharacteristic which indicates whether the answer to the question is tobe found in a single clinical note or across multiple clinical notes orpassages. In addition, the question profile comprises an answer clinicalnote type (answer note type) which indicates the type of clinical notein which the answer is to be found for this type of question, e.g., labreports, structured procedures, progress notes, discharge summaries,etc. These are classifications of note types that may be present in apatient EMR and such notes may be tagged with similar classifications inthe patient EMR.

The question profile may further comprise an indication of thedistribution in the patient EMR where the answer is likely to be found,e.g., evenly distributed across all recent lab notes, evenly distributedacross progress notes and discharge summaries around the time of the labreport that mentions the xray. Moreover, the question profile maycomprise a cardinality that indicates the maximum number of passages orclinical notes that likely provide a portion of the answer to thequestion.

The cardinality may be used to determine the number of QA pairs that areadded to the labeled training set. The other features of the questionprofile may be used to perform a weighted calculation of the dynamicconfidence score threshold value to be used to select QA pairs, such asby way of using a linear combination of the features as a threshold, forexample. Thus, logic may be provided that selects a number of QA pairscorresponding to the cardinality of the question profile associated withthe unanswered question in the unlabeled set U 120 that have confidencescores equal to or above the dynamically generated confidence scorethreshold value. In some cases, cardinality may take priority overconfidence score threshold and thus, if there are not enough QA pairsthat have a sufficiently high enough confidence score to equal or exceedthe threshold, then the corresponding top highest scoring confidencescore QA pairs may be selected. Alternatively, in other illustrativeembodiments, only those that are equal to or above the threshold areavailable for selection and thus, less than the cardinality number of QApairs may be selected. As noted above, if a cardinality is notspecified, then a default number may be utilized.

In addition, the features set forth in the question profiles may be usedas a basis for performing clustering operations for clustering similarquestions, i.e. questions that are directed to finding a similar answer.For example, unanswered questions in the unlabeled set may be associatedwith a question profile of a similar question in the labeled trainingset, and that associated question profile may then be used to clusterthe unanswered question with other questions in the unlabeled set thatare associated with the same question profile or other question profilesthat have similar features. This may utilize a similarity scoringalgorithm that scores the similarities of features of questions profilesand compares these similarity scores to a threshold, with questionshaving a high enough similarity score (e.g., equal to or above thethreshold) being combined into a similar question group or cluster,e.g., the values of features may be used to generate a feature vectorand the feature vectors may be compared, such as by using cosinesimilarity, to generate a semantic similarity score where if thesemantic similarity score is equal to or higher than a threshold value,then the questions are considered to be part of the same group orcluster. Based on the clustering performed, the selected QA pairs for afirst unanswered question in the cluster may be used to modify theconfidence scores of QA pairs for other similar questions in the clusterthat include the answer of the selected QA pairs for the firstunanswered question. For example, if a first QA pair has question Q1 andanswer A1, and a second question Q2 in the same cluster also has asecond QA pair with question Q2 and answer A1, the confidence score ofthat second QA pair may be increased by a weighting factor, which isimplementation dependent, since the first QA pair was selected for thefirst question Q1 for inclusion in the expansion of the labeled trainingset L 110.

Thus, with the mechanisms of the illustrative embodiments, given arelatively small labeled training set 110 having a set of ground truthquestion and answer pairs associated with corresponding questionprofiles, and a relatively larger unlabeled set U 120 of unansweredquestions, an iterative self-training process is performed thatalleviates the need for significant human effort to generate largelabeled training sets for training the model to be used by a cognitivesystem when performing cognitive operations. The self-training processtrains the machine learning model based on the labeled training set 110which is iteratively expanded automatically by correlating unansweredquestions with every possible answer in a patient's EMR, e.g., everyclinical note in the patient EMR, to generate a plurality of QA pairsfor each unanswered question and then selecting those that have thehighest confidence of being a correct answer for the unansweredquestion. That is, the trained model to that point in time is used togenerate confidence scores for the QA pairs. The confidence scores of QApairs may be modified based on clustering with other questions and QApairs selected for other similar unanswered questions that have beenadded to the labeled training set L 110 in a previous iteration. Aquestion profile of a question in the labeled training set L 110 isassociated with the unanswered question based on a similarity analysisand based on the characteristics of the associated question profile,dynamic selection of QA pairs based on their confidence scores isperformed with the selected QA pairs being added to the labeled trainingset L 110. The process may then be iteratively repeated until either allof the unanswered questions in the unlabeled set 120 are added to thelabeled training set L 110, or the self-trained machine learning modelconverges, i.e., the differences between one iteration of training andanother is less than a threshold difference.

From the above, it can be appreciated that the illustrative embodimentsmay be utilized in many different types of data processing environments.In order to provide a context for the description of the specificelements and functionality of the illustrative embodiments, FIGS. 3-5are provided hereafter as example environments in which aspects of theillustrative embodiments may be implemented. It should be appreciatedthat FIGS. 3-5 are only examples and are not intended to assert or implyany limitation with regard to the environments in which aspects orembodiments of the present invention may be implemented. Manymodifications to the depicted environments may be made without departingfrom the spirit and scope of the present invention.

FIGS. 3-5 are directed to describing an example cognitive system forhealthcare applications (also referred to herein as a “healthcarecognitive system”) which implements a request processing pipeline, suchas a Question Answering (QA) pipeline (also referred to as aQuestion/Answer pipeline or Question and Answer pipeline) for example,request processing methodology, and request processing computer programproduct with which the mechanisms of the illustrative embodiments areimplemented. These requests may be provided as structure or unstructuredrequest messages, natural language questions, or any other suitableformat for requesting an operation to be performed by the healthcarecognitive system. As described in more detail hereafter, the particularhealthcare application that is implemented in the cognitive system ofthe present invention is a healthcare application for obtaining answersto natural language questions posed by medical personnel which may beanswered based on a corpus of medical knowledge documents and/or patientelectronic medical records (EMRs). Thus, in one illustrative embodiment,medical personnel may ask natural language questions about a particularpatient, and the healthcare cognitive system will provide answers tothose questions based on its analysis of the patient's EMRs using atrained model which is trained using a self-training process inaccordance with the illustrative embodiments.

It should be appreciated that the healthcare cognitive system, whileshown as having a single request processing pipeline in the exampleshereafter, may in fact have multiple request processing pipelines. Eachrequest processing pipeline may be separately trained and/or configuredto process requests associated with different domains or be configuredto perform the same or different analysis on input requests (orquestions in implementations using a QA pipeline), depending on thedesired implementation. For example, in some cases, a first requestprocessing pipeline may be trained to operate on input requests directedto a first medical malady domain (e.g., various types of blood diseases)while another request processing pipeline may be trained to answer inputrequests in another medical malady domain (e.g., various types ofcancers). In other cases, for example, the request processing pipelinesmay be configured to provide different types of cognitive functions orsupport different types of healthcare applications, such as one requestprocessing pipeline being used for patient diagnosis, another requestprocessing pipeline being configured for medical treatmentrecommendation, another request processing pipeline being configured forpatient monitoring, etc.

Moreover, each request processing pipeline may have their own associatedcorpus or corpora that they ingest and operate on, e.g., one corpus forblood disease domain documents and another corpus for cancer diagnosticsdomain related documents in the above examples. In some cases, therequest processing pipelines may each operate on the same domain ofinput questions but may have different configurations, e.g., differentannotators or differently trained annotators, such that differentanalysis and potential answers are generated. The healthcare cognitivesystem may provide additional logic for routing input questions to theappropriate request processing pipeline, such as based on a determineddomain of the input request, combining and evaluating final resultsgenerated by the processing performed by multiple request processingpipelines, and other control and interaction logic that facilitates theutilization of multiple request processing pipelines.

As noted above, one type of request processing pipeline with which themechanisms of the illustrative embodiments may be utilized is a QuestionAnswering (QA) pipeline. The description of example embodiments of thepresent invention hereafter will utilize a QA pipeline as an example ofa request processing pipeline that may be augmented to includemechanisms in accordance with one or more illustrative embodiments. Itshould be appreciated that while the present invention will be describedin the context of the cognitive system implementing one or more QApipelines that operate on an input question, the illustrativeembodiments are not limited to such. Rather, the mechanisms of theillustrative embodiments may operate on requests that are not posed as“questions” but are formatted as requests for the cognitive system toperform cognitive operations on a specified set of input data using theassociated corpus or corpora and the specific configuration informationused to configure the cognitive system. For example, rather than askinga natural language question of “What diagnosis applies to patient P?”,the cognitive system may instead receive a request of “generatediagnosis for patient P,” or the like. It should be appreciated that themechanisms of the QA system pipeline may operate on requests in asimilar manner to that of input natural language questions with minormodifications. In fact, in some cases, a request may be converted to anatural language question for processing by the QA system pipelines ifdesired for the particular implementation.

As will be discussed in greater detail hereafter, the illustrativeembodiments may be integrated in, augment, and extend the functionalityof these QA pipeline, or request processing pipeline, mechanisms of ahealthcare cognitive system with regard to the training of the modelsimplemented by the pipeline/cognitive system to analyze input naturallanguage questions and/or evaluate the corpus or corpora ofdocumentation used to answer such natural language questions. Inparticular, the illustrative embodiments provide mechanisms forautomatically expanding a relatively small initial set of ground truthquestions and answers used to train a machine learning model, byiteratively performing confidence scoring of a larger set of unansweredquestions with regard to a set of training data, e.g., one or morepatient EMRs, and utilizing question profiles to determine the selectionof question and answer (QA) pairs to be added to the ground truthquestion and answers, which then results in another iteration oftraining of the machine learning model and expansion of the ground truthset of questions and answers.

Since the illustrative embodiments are described in the context of arequest processing pipeline, which is considered to be a QA pipeline inthe description herein for purposes of illustration, it is important tofirst have an understanding of how cognitive systems and question andanswer creation in a cognitive system implementing a QA pipeline isimplemented before describing how the mechanisms of the illustrativeembodiments are integrated in and augment such cognitive systems andrequest processing pipeline, or QA pipeline, mechanisms. It should beappreciated that the mechanisms described in FIGS. 3-5 are only examplesand are not intended to state or imply any limitation with regard to thetype of cognitive system mechanisms with which the illustrativeembodiments are implemented. Many modifications to the example cognitivesystem shown in FIGS. 3-5 may be implemented in various embodiments ofthe present invention without departing from the spirit and scope of thepresent invention.

As an overview, a cognitive system is a specialized computer system, orset of computer systems, configured with hardware and/or software logic(in combination with hardware logic upon which the software executes) toemulate human cognitive functions. These cognitive systems applyhuman-like characteristics to conveying and manipulating ideas which,when combined with the inherent strengths of digital computing, cansolve problems with high accuracy and resilience on a large scale. Acognitive system performs one or more computer-implemented cognitiveoperations that approximate a human thought process as well as enablepeople and machines to interact in a more natural manner so as to extendand magnify human expertise and cognition. A cognitive system comprisesartificial intelligence logic, such as natural language processing (NLP)based logic, for example, and machine learning logic, which may beprovided as specialized hardware, software executed on hardware, or anycombination of specialized hardware and software executed on hardware.The logic of the cognitive system implements the cognitive operation(s),examples of which include, but are not limited to, question answering,identification of related concepts within different portions of contentin a corpus, intelligent search algorithms, such as Internet web pagesearches, for example, medical diagnostic and treatment recommendations,and other types of recommendation generation, e.g., items of interest toa particular user, potential new contact recommendations, or the like.

IBM Watson™ is an example of one such cognitive system which can processhuman readable language and identify inferences between text passageswith human-like high accuracy at speeds far faster than human beings andon a larger scale. In general, such cognitive systems are able toperform the following functions:

-   -   Navigate the complexities of human language and understanding    -   Ingest and process vast amounts of structured and unstructured        data    -   Generate and evaluate hypothesis    -   Weigh and evaluate responses that are based only on relevant        evidence    -   Provide situation-specific advice, insights, and guidance    -   Improve knowledge and learn with each iteration and interaction        through machine learning processes    -   Enable decision making at the point of impact (contextual        guidance)    -   Scale in proportion to the task    -   Extend and magnify human expertise and cognition    -   Identify resonating, human-like attributes and traits from        natural language    -   Deduce various language specific or agnostic attributes from        natural language    -   High degree of relevant recollection from data points (images,        text, voice) (memorization and recall)    -   Predict and sense with situational awareness that mimic human        cognition based on experiences    -   Answer questions based on natural language and specific evidence

In one aspect, cognitive systems provide mechanisms for answeringquestions posed to these cognitive systems using a Question Answeringpipeline or system (QA system) and/or process requests which may or maynot be posed as natural language questions. The QA pipeline or system isan artificial intelligence application executing on data processinghardware that answers questions pertaining to a given subject-matterdomain presented in natural language. The QA pipeline receives inputsfrom various sources including input over a network, a corpus ofelectronic documents or other data, data from a content creator,information from one or more content users, and other such inputs fromother possible sources of input. Data storage devices store the corpusof data. A content creator creates content in a document for use as partof a corpus of data with the QA pipeline. The document may include anyfile, text, article, or source of data for use in the QA system. Forexample, a QA pipeline accesses a body of knowledge about the domain, orsubject matter area, e.g., financial domain, medical domain, legaldomain, etc., where the body of knowledge (knowledgebase) can beorganized in a variety of configurations, e.g., a structured repositoryof domain-specific information, such as ontologies, or unstructured datarelated to the domain, or a collection of natural language documentsabout the domain.

Content users input questions to cognitive system which implements theQA pipeline. The QA pipeline then answers the input questions using thecontent in the corpus of data by evaluating documents, sections ofdocuments, portions of data in the corpus, or the like. When a processevaluates a given section of a document for semantic content, theprocess can use a variety of conventions to query such document from theQA pipeline, e.g., sending the query to the QA pipeline as a well-formedquestion which is then interpreted by the QA pipeline and a response isprovided containing one or more answers to the question. Semanticcontent is content based on the relation between signifiers, such aswords, phrases, signs, and symbols, and what they stand for, theirdenotation, or connotation. In other words, semantic content is contentthat interprets an expression, such as by using Natural LanguageProcessing.

As will be described in greater detail hereafter, the QA pipelinereceives an input question, parses the question to extract the majorfeatures of the question, uses the extracted features to formulatequeries, and then applies those queries to the corpus of data. Based onthe application of the queries to the corpus of data, the QA pipelinegenerates a set of hypotheses, or candidate answers to the inputquestion, by looking across the corpus of data for portions of thecorpus of data that have some potential for containing a valuableresponse to the input question. The QA pipeline then performs deepanalysis on the language of the input question and the language used ineach of the portions of the corpus of data found during the applicationof the queries using a variety of reasoning algorithms. There may behundreds or even thousands of reasoning algorithms applied, each ofwhich performs different analysis, e.g., comparisons, natural languageanalysis, lexical analysis, or the like, and generates a score. Forexample, some reasoning algorithms may look at the matching of terms andsynonyms within the language of the input question and the foundportions of the corpus of data. Other reasoning algorithms may look attemporal or spatial features in the language, while others may evaluatethe source of the portion of the corpus of data and evaluate itsveracity.

The scores obtained from the various reasoning algorithms indicate theextent to which the potential response is inferred by the input questionbased on the specific area of focus of that reasoning algorithm. Eachresulting score is then weighted against a statistical model. Thestatistical model captures how well the reasoning algorithm performed atestablishing the inference between two similar passages for a particulardomain during the training period of the QA pipeline. The statisticalmodel is used to summarize a level of confidence that the QA pipelinehas regarding the evidence that the potential response, i.e. candidateanswer, is inferred by the question. This process is repeated for eachof the candidate answers until the QA pipeline identifies candidateanswers that surface as being significantly stronger than others andthus, generates a final answer, or ranked set of answers, for the inputquestion.

As mentioned above, QA pipeline mechanisms operate by accessinginformation from a corpus of data or information (also referred to as acorpus of content), analyzing it, and then generating answer resultsbased on the analysis of this data. Accessing information from a corpusof data typically includes: a database query that answers questionsabout what is in a collection of structured records, and a search thatdelivers a collection of document links in response to a query against acollection of unstructured data (text, markup language, etc.).Conventional question answering systems are capable of generatinganswers based on the corpus of data and the input question, verifyinganswers to a collection of questions for the corpus of data, correctingerrors in digital text using a corpus of data, and selecting answers toquestions from a pool of potential answers, i.e. candidate answers.

Content creators, such as article authors, electronic document creators,web page authors, document database creators, and the like, determineuse cases for products, solutions, and services described in suchcontent before writing their content. Consequently, the content creatorsknow what questions the content is intended to answer in a particulartopic addressed by the content. Categorizing the questions, such as interms of roles, type of information, tasks, or the like, associated withthe question, in each document of a corpus of data allows the QApipeline to more quickly and efficiently identify documents containingcontent related to a specific query. The content may also answer otherquestions that the content creator did not contemplate that may beuseful to content users. The questions and answers may be verified bythe content creator to be contained in the content for a given document.These capabilities contribute to improved accuracy, system performance,machine learning, and confidence of the QA pipeline. Content creators,automated tools, or the like, annotate or otherwise generate metadatafor providing information useable by the QA pipeline to identify thesequestion and answer attributes of the content.

Operating on such content, the QA pipeline generates answers for inputquestions using a plurality of intensive analysis mechanisms whichevaluate the content to identify the most probable answers, i.e.candidate answers, for the input question. The most probable answers areoutput as a ranked listing of candidate answers ranked according totheir relative scores or confidence measures calculated duringevaluation of the candidate answers, as a single final answer having ahighest ranking score or confidence measure, or which is a best match tothe input question, or a combination of ranked listing and final answer.

FIG. 3 depicts a schematic diagram of one illustrative embodiment of acognitive system 300 implementing a request processing pipeline 308,which in some embodiments may be a question answering (QA) pipeline, ina computer network 302. For purposes of the present description, it willbe assumed that the request processing pipeline 308 is implemented as aQA pipeline that operates on structured and/or unstructured requests inthe form of input questions. One example of a question processingoperation which may be used in conjunction with the principles describedherein is described in U.S. Patent Application Publication No.2011/0125734, which is herein incorporated by reference in its entirety.The cognitive system 300 is implemented on one or more computing devices304, 305, 310, and 312 (comprising one or more processors and one ormore memories, and potentially any other computing device elementsgenerally known in the art including buses, storage devices,communication interfaces, and the like) connected to the computernetwork 302. The network 302 includes multiple computing devices 304,305, 310, and 312 in communication with each other and with otherdevices or components via one or more wired and/or wireless datacommunication links, where each communication link comprises one or moreof wires, routers, switches, transmitters, receivers, or the like. Thecognitive system 300, which may be implemented in a server computingdevice such as server 305, and network 302 enable question processingand answer generation (QA) functionality for one or more cognitivesystem users via their respective computing devices 310-312. Otherembodiments of the cognitive system 300 may be used with components,systems, sub-systems, and/or devices other than those that are depictedherein.

The cognitive system 300 is configured to implement a QA pipeline 308that receive inputs from various sources. For example, the cognitivesystem 300 receives input from the network 302, one or more corpora ofelectronic documents 306 and/or 340, cognitive system users, and/orother data and other possible sources of input. In one embodiment, someor all of the inputs to the cognitive system 300 are routed through thenetwork 302. The various computing devices 304 on the network 302include access points for content creators and QA system users. Some ofthe computing devices 304 include devices for a database storing thecorpus of data 106 (which is shown as a separate entity in FIG. 3 forillustrative purposes only). Portions of the corpus of data 306 may alsobe provided on one or more other network attached storage devices, inone or more databases, or other computing devices not explicitly shownin FIG. 3. The network 302 includes local network connections and remoteconnections in various embodiments, such that the cognitive system 300may operate in environments of any size, including local and global,e.g., the Internet.

In one embodiment, the content creator creates content in a document ofthe corpus of data 306 for use as part of a corpus of data with thecognitive system 300. The document includes any file, text, article, orsource of data for use in the cognitive system 300. QA system usersaccess the cognitive system 300 via a network connection or an Internetconnection to the network 302, and input questions to the cognitivesystem 300 that are answered by the content in the corpus of data 306.In one embodiment, the questions are formed using natural language. Thecognitive system 300 parses and interprets the question via a QApipeline 308, and provides a response to the cognitive system user,e.g., cognitive system user 310, containing one or more answers to thequestion. In some embodiments, the cognitive system 300 provides aresponse to users in a ranked list of candidate answers while in otherillustrative embodiments, the cognitive system 300 provides a singlefinal answer or a combination of a final answer and ranked listing ofother candidate answers.

The cognitive system 300 implements the QA pipeline 308 which comprisesa plurality of stages for processing an input question and the corpus ofdata 306. The QA pipeline 308 generates answers for the input questionbased on the processing of the input question and the corpus of data306.

In some illustrative embodiments, the cognitive system 300 may be theIBM Watson™ cognitive system available from International BusinessMachines Corporation of Armonk, N.Y., which is augmented with themechanisms of the illustrative embodiments described hereafter. Asoutlined previously, a QA pipeline of the IBM Watson™ cognitive systemreceives an input question which it then parses to extract the majorfeatures of the question, which in turn are then used to formulatequeries that are applied to the corpus of data. Based on the applicationof the queries to the corpus of data, a set of hypotheses, or candidateanswers to the input question, are generated by looking across thecorpus of data for portions of the corpus of data that have somepotential for containing a valuable response to the input question. TheQA pipeline of the IBM Watson™ cognitive system then performs deepanalysis on the language of the input question and the language used ineach of the portions of the corpus of data found during the applicationof the queries using a variety of reasoning algorithms.

The scores obtained from the various reasoning algorithms are thenweighted against a statistical model that summarizes a level ofconfidence that the QA pipeline of the IBM Watson™ cognitive system hasregarding the evidence that the potential response, i.e. candidateanswer, is inferred by the question. This process is be repeated foreach of the candidate answers to generate ranked listing of candidateanswers which may then be presented to the user that submitted the inputquestion, or from which a final answer is selected and presented to theuser. More information about the QA pipeline of the IBM Watson™cognitive system may be obtained, for example, from the IBM Corporationwebsite, IBM Redbooks, and the like. For example, information about theQA pipeline of the IBM Watson™ cognitive system can be found in Yuan etal., “Watson and Healthcare,” IBM developerWorks, 2011 and “The Era ofCognitive Systems: An Inside Look at IBM Watson and How it Works” by RobHigh, IBM Redbooks, 2012.

As noted above, while the input to the cognitive system 300 from aclient device may be posed in the form of a natural language question,the illustrative embodiments are not limited to such. Rather, the inputquestion may in fact be formatted or structured as any suitable type ofrequest which may be parsed and analyzed using structured and/orunstructured input analysis, including but not limited to the naturallanguage parsing and analysis mechanisms of a cognitive system such asIBM Watson™, to determine the basis upon which to perform cognitiveanalysis and providing a result of the cognitive analysis. In the caseof a healthcare based cognitive system, this analysis may involveprocessing patient medical records, medical guidance documentation fromone or more corpora, and the like, to provide a healthcare orientedcognitive system result.

In the context of the present invention, cognitive system 300 mayprovide a cognitive functionality for assisting with healthcare basedoperations. For example, depending upon the particular implementation,the healthcare based operations may comprise patient diagnostics,medical treatment recommendation systems, medical practice managementsystems, personal patient care plan generation and monitoring, patientelectronic medical record (EMR) evaluation for various purposes, such asfor identifying patients that are suitable for a medical trial or aparticular type of medical treatment, or the like. Thus, the cognitivesystem 300 may be a healthcare cognitive system 300 that operates in themedical or healthcare type domains and which may process requests forsuch healthcare operations via the request processing pipeline 308 inputas either structured or unstructured requests, natural language inputquestions, or the like. In one illustrative embodiment, the cognitivesystem 300 is a patient EMR question and answer system which providesfunctionality for answering questions posed by patients and/or medicalpersonnel with regard to a patient's medical condition and the contentsof the patient EMR. Such a QA system may further provide treatmentrecommendation functionality for recommending treatments for the patientbased on an analysis of the patient's medical conditions and informationavailable in one or more medical corpora having documents directed tothe treatment of various medical conditions, e.g., medical guidelinedocuments, medical reference documents, pharmaceutical documentation,medical procedure documentation, and the like.

As shown in FIG. 3, the cognitive system 300 is further augmented, inaccordance with the mechanisms of the illustrative embodiments, toinclude logic implemented in specialized hardware, software executed onhardware, or any combination of specialized hardware and softwareexecuted on hardware, for implementing a question profile basedself-training engine 320. The question profile based self-trainingengine 320 provides the logic for performing self-training of one ormore machine learning models 350 that implement the logic for analyzingnatural language questions, identifying candidate answers to thosenatural language questions, scoring the candidate answers, and/orselecting a set of one or more final answers for return to an originatorof a natural language question as the answer(s) to the natural languagequestion.

As shown in FIG. 3, the question profile based self-training engine 320(hereafter referred to as the “self-training engine 320”) comprises alabeled training set (L) 322, self-training model logic 324, unlabeledset (U) 326, question profiles 328, question profile generator 329, andlabeled training set expansion logic 330. The labeled training set (L)322 comprises a plurality of question and answer (QA) pairs thatrepresent the ground truth for training the model 350. The ground truthindicates a set of questions for which the correct answer is known to betrue and is associated with the question in the labeled training set (L)322. In accordance with the mechanisms of the illustrative embodiments,the labeled training set (L) 322 is initially a relatively small set ofQA pairs, e.g., 100 QA pairs, which is then expanded iteratively throughthe automated mechanisms for selecting QA pairs from the unlabeled set(U) 326.

The labeled training set (L) 322 may initially be generated by one ormore subject matter experts (SMEs) that provide the questions and theirknown correct answers. In addition, the SMEs may provide a questionprofile for each of the questions in the labeled training set (L) 322,which is stored in the question profiles data structures 328.Alternatively, the questions in the labeled training set (L) may beanalyzed, such as by question profile generator 329, using naturallanguage processing techniques which extract the features orcharacteristics of the question and its corresponding answer andpopulates the question profile fields automatically based on the resultsof such natural language processing. The question profile generator 329may comprise its own natural language processing logic or may utilizenatural language processing logic present in the cognitive system 300 toperform such natural language processing and extraction offeatures/characteristics from the questions and answers.

The unlabeled set (U) 326 initially comprises a set of unansweredquestions that have been generated as candidates for expanding thelabeled training set (L) 322. Again, these questions may be naturallanguage questions that are generated by one or more SMEs. As notedabove, it is relatively easy for SMEs to generate questions that arerepresentative of the types of questions that a patient and/or medicalprofessional may ask of a QA system. The difficulty is generatingcorrect answers for such questions in an automated manner using thecognitive logic of a QA system. Thus, the creation of the initialunlabeled set (U) 326 does not represent a significant amount of humaneffort. The unlabeled set (U) 326 may, after initially being provided,be processed by the self-training model logic 324 to associated eachunanswered question in the unlabeled set (U) 326 with each portion of atraining data set to generate a plurality of question answer (QA) pairsfor the unanswered question, the set of QA pairs thereby representingall of the possible sources of answers to the unanswered question. Forexample, if a training data set has portions 1-1000, then eachunanswered question will have 1000 QA pairs that are to be evaluated. Inaccordance with some illustrative embodiments used as examples herein,the training data set may be part of the corpus 340 and may comprise oneor more patient EMRs for one or more patients, and having portions,e.g., clinical notes, that may serve as answers to natural languagequestions.

Given the labeled training set (L) 322, the self-training model logic324 performs a first iteration to train the machine learning model 350and thereby generate an initial state of the machine learning model 350.The self-training model logic 324 then applies the trained model 350 tothe unlabeled set (U) 326 comprising the QA pairs, where in the contextof a medical cognitive system, for example, each QA pair comprises anunanswered question that may be posed by a patient or medicalprofessional paired with a clinical note, or portion of a patient EMR,that may represent the answer to the unanswered question. Theapplication of the currently trained model 350 to the unlabeled set (U)326 results in the generation of confidence scores associated with eachof the QA pairings, where the confidence scores indicate a confidence bythe currently trained model 350 that the answer of a QA pair is acorrect answer for the unanswered question of the QA pair. Thisgenerates, for each unanswered question, a set of answers and confidencescores associated with each of the answers, where the answers areclinical notes in, or portions of, the EMR training data set.

The labeled training set expansion logic 330 provides the logic forselecting QA pairs from the unlabeled set (U) 326 for inclusion in thelabeled training set (L) 322 based on their confidence scores and anassociation of question profiles with the unanswered questions. Thelabeled training set expansion logic 330 may also provide logic forperforming clustering of unanswered questions based on their associatedquestion profile features/characteristics and/or other key featuresextracted from the unanswered questions, so as to modify the confidencescores associated with QA pairs based on the selection of QA pairsperformed for other unanswered questions in a same cluster.

For example, the labeled training set expansion logic 330 may associatewith each unanswered question (Q) in the QA pairs of the unlabeled set(U) 326 a question profile from the question profile data structures328, which corresponds to the labeled training set (L) 322. That is, asimilarity evaluation is performed by the labeled training set expansionlogic 330 for determining a similarity between the unanswered questionin the unlabeled set (U) 326 and a question in the labeled training set(L) 322. A similarity score based on a correlation of features may begenerated and a highest scoring similar question in the labeled trainingset (L) 322 is identified for the unanswered question in the unlabeledset (U) 326. The selected similar question in the labeled training set(L) 322 has a corresponding question profile in the question profilesdata structures 328 which is then applied to the unanswered question inthe unlabeled set (U) 326. The associated question profile is used bythe labeled training set expansion logic 330 to dynamically calculate acorresponding confidence score threshold value and/or a number of QApairs, for an unanswered question, to select from the unlabeled set (U)326 for inclusion in the labeled training set (L) 322, if any.

That is, based on the question profile associated with an unansweredquestion, a confidence score threshold value is calculated thatindicates the minimum level of confidence needed for a QA pairassociated with the unanswered question, to be considered for inclusionin the labeled training set (L) 322. It should be appreciated that ifsufficient information is not present in the question profile, or if aquestion profile cannot be identified for the unanswered question, thena default threshold confidence score value may be utilized. Moreover,while this threshold value indicates a minimum level of confidence, insome implementations this minimum level of confidence may be supersededby a required number of QA pairs to be selected such that if there arenot a sufficient number of QA pairs that have confidence scores equal toor exceeding the threshold value, some QA pairs that fall below thethreshold value may be selected for inclusion in the labeled trainingset (L) 322.

In addition, based on information in the question profile associatedwith the unanswered question, a number of QA pairs to be selected forinclusion in the labeled training set (L) 322 may be identified. Thisnumber may be determined based on the cardinality indicated in thequestion profile, for example. Alternatively, a plurality offeatures/characteristics specified in the question profile may beevaluated using weighted logic to generate a number of QA pairs to beselected. In general, the number of QA pairs to be selected is combinedwith the minimum confidence score threshold value such that the numberof QA pairs selected comprises a corresponding number of QA pairsselected from those QA pairs that meet or exceed the minimum confidencescore threshold. In some illustrative embodiments, the highest scoringQA pairs are selected first prior to selecting lower scoring QA pairs.

Thus, based on the identified dynamic confidence score threshold, or adefault threshold, and the confidence scores associated with the QApairs in the unlabeled set U 120 for the question, QA pairs that haveconfidence scores meeting or exceeding the threshold confidence scorevalue are selected by the labeled training set expansion logic 330. Thenumber of such QA pairs that are selected may be based on the dynamicdetermination corresponding to the question profile as discussed above.These QA pairs are added to the labeled training set (L) 322 to therebyexpand the labeled training set (L) 322 to include additional QA pairsas part of the ground truth. The expanded labeled training set (nowreferred to as L′) is then used to update the training of the machinelearning model 350 using the expanded QA pairs of the labeled trainingset (L′) 322. This process is repeated until either all of theunanswered questions from the unlabeled set (U) 326 have been added tothe labeled training set (L) 322, i.e. at least one QA pair for eachunanswered question has been added to the labeled training set (L) 322,or the machine learning model 350 has reached a convergence state, e.g.,a change in operational parameters of the machine learning model 350does not meet or exceed a threshold change level, e.g., no furtherappreciable change in the operational parameters of the model 350 isperformed.

As noted above, one additional functionality performed by the labeledtraining set expansion logic 330 is to perform clustering of unansweredquestions based on features/characteristics present in their associatedquestion profiles. That is, a clustering operation may be performed onunanswered questions in the unlabeled set U 326 by applying a similarityanalysis and scoring logic that scores the degree of similarity betweenone or more characteristics/features present in the question profiles.The similarity score may be compared to a threshold similarity scorevalue and if the threshold is met or exceeded, then the two unansweredquestions are considered to be sufficiently similar to be clusteredtogether. For instance, a question cluster may contain a set ofquestions with a high semantic similarity score, i.e. all the questionsin the same cluster are essentially asking for the same answer indifferent ways.

Based on the clustering, QA pairs selected for inclusion in the labeledtraining set (L) 322 during a previous iteration of the self-trainingoperation may be used as a basis for modifying confidence scores forsimilar QA pairs of other unanswered questions in the same cluster.Thus, greater confidence is given to QA pairs having similar answers tothat in QA pairs already added to the labeled training set (L) 322 inprevious iterations. In this way, different questions that are askingfor similar answers are added to the labeled training set (L) 322 tothereby cover various possible ways in which patients and/or medicalprofessionals are likely to ask for particular answers.

As mentioned above, this process is iteratively repeated with eachiteration potentially resulting in a larger expanded labeled trainingset (L) 322 which is expanded automatically based on the operation ofthe question profile based self-training engine 320. The resultingtrained model(s) 350 may then be output or otherwise provided to thecognitive system 300 which implements these model(s) 350 in theoperation of the request processing pipeline 308. The model(s) 350 maybe used to analyze the input questions/requests, generate candidateanswers/results for the input questions/requests, score the candidateanswers/results, and/or select one or more final answers/results to bereturned to an originator of the input question/request, e.g., a user ofclient computing device 310 or 312.

As is evident from the above, the mechanisms of the illustrativeembodiments are rooted in the computer technology arts and areimplemented using logic present in such computing or data processingsystems. These computing or data processing systems are specificallyconfigured, either through hardware, software, or a combination ofhardware and software, to implement the various operations describedabove. As such, FIG. 4 is provided as an example of one type of dataprocessing system in which aspects of the present invention may beimplemented. Many other types of data processing systems may be likewiseconfigured to specifically implement the mechanisms of the illustrativeembodiments.

FIG. 4 is a block diagram of an example data processing system in whichaspects of the illustrative embodiments are implemented. Data processingsystem 400 is an example of a computer, such as server 305 or client 310in FIG. 3, in which computer usable code or instructions implementingthe processes for illustrative embodiments of the present invention arelocated. In one illustrative embodiment, FIG. 4 represents a servercomputing device, such as a server 305, which implements a cognitivesystem 300 and QA system pipeline 308 augmented to include theadditional mechanisms of the illustrative embodiments describedhereafter.

In the depicted example, data processing system 400 employs a hubarchitecture including North Bridge and Memory Controller Hub (NB/MCH)402 and South Bridge and Input/Output (I/O) Controller Hub (SB/ICH) 204.Processing unit 406, main memory 408, and graphics processor 410 areconnected to NB/MCH 402. Graphics processor 410 is connected to NB/MCH402 through an accelerated graphics port (AGP).

In the depicted example, local area network (LAN) adapter 412 connectsto SB/ICH 404. Audio adapter 416, keyboard and mouse adapter 420, modem422, read only memory (ROM) 424, hard disk drive (HDD) 426, CD-ROM drive430, universal serial bus (USB) ports and other communication ports 432,and PCI/PCIe devices 434 connect to SB/ICH 404 through bus 438 and bus440. PCI/PCIe devices may include, for example, Ethernet adapters,add-in cards, and PC cards for notebook computers. PCI uses a card buscontroller, while PCIe does not. ROM 424 may be, for example, a flashbasic input/output system (BIOS).

HDD 426 and CD-ROM drive 430 connect to SB/ICH 404 through bus 440. HDD426 and CD-ROM drive 430 may use, for example, an integrated driveelectronics (IDE) or serial advanced technology attachment (SATA)interface. Super I/O (SIO) device 436 is connected to SB/ICH 404.

An operating system runs on processing unit 406. The operating systemcoordinates and provides control of various components within the dataprocessing system 400 in FIG. 4. As a client, the operating system is acommercially available operating system such as Microsoft® Windows 10®.An object-oriented programming system, such as the Java™ programmingsystem, may run in conjunction with the operating system and providescalls to the operating system from Java™ programs or applicationsexecuting on data processing system 400.

As a server, data processing system 400 may be, for example, an IBM®eServer™ System p® computer system, running the Advanced InteractiveExecutive (AIX®) operating system or the LINUX® operating system. Dataprocessing system 200 may be a symmetric multiprocessor (SMP) systemincluding a plurality of processors in processing unit 406.Alternatively, a single processor system may be employed.

Instructions for the operating system, the object-oriented programmingsystem, and applications or programs are located on storage devices,such as HDD 426, and are loaded into main memory 408 for execution byprocessing unit 406. The processes for illustrative embodiments of thepresent invention are performed by processing unit 406 using computerusable program code, which is located in a memory such as, for example,main memory 408, ROM 424, or in one or more peripheral devices 426 and430, for example.

A bus system, such as bus 438 or bus 440 as shown in FIG. 4, iscomprised of one or more buses. Of course, the bus system may beimplemented using any type of communication fabric or architecture thatprovides for a transfer of data between different components or devicesattached to the fabric or architecture. A communication unit, such asmodem 422 or network adapter 412 of FIG. 4, includes one or more devicesused to transmit and receive data. A memory may be, for example, mainmemory 408, ROM 424, or a cache such as found in NB/MCH 402 in FIG. 4.

Those of ordinary skill in the art will appreciate that the hardwaredepicted in FIGS. 3 and 4 may vary depending on the implementation.Other internal hardware or peripheral devices, such as flash memory,equivalent non-volatile memory, or optical disk drives and the like, maybe used in addition to or in place of the hardware depicted in FIGS. 3and 4. Also, the processes of the illustrative embodiments may beapplied to a multiprocessor data processing system, other than the SMPsystem mentioned previously, without departing from the spirit and scopeof the present invention.

Moreover, the data processing system 400 may take the form of any of anumber of different data processing systems including client computingdevices, server computing devices, a tablet computer, laptop computer,telephone or other communication device, a personal digital assistant(PDA), or the like. In some illustrative examples, data processingsystem 400 may be a portable computing device that is configured withflash memory to provide non-volatile memory for storing operating systemfiles and/or user-generated data, for example. Essentially, dataprocessing system 400 may be any known or later developed dataprocessing system without architectural limitation.

FIG. 5 is an example diagram illustrating an interaction of elements ofa healthcare cognitive system in accordance with one illustrativeembodiment. The example diagram of FIG. 5 depicts an implementation of ahealthcare cognitive system 500 that is configured to provide medicaltreatment recommendations for patients. However, it should beappreciated that this is only an example implementation and otherhealthcare operations may be implemented in other embodiments of thehealthcare cognitive system 500 without departing from the spirit andscope of the present invention.

Moreover, it should be appreciated that while FIG. 5 depicts the patient502 and user 506 as human figures, the interactions with and betweenthese entities may be performed using computing devices, medicalequipment, and/or the like, such that entities 502 and 506 may in factbe computing devices, e.g., client computing devices. For example, theinteractions 504, 514, 516, and 530 between the patient 502 and the user506 may be performed orally, e.g., a doctor interviewing a patient, andmay involve the use of one or more medical instruments, monitoringdevices, or the like, to collect information that may be input to thehealthcare cognitive system 500 as patient attributes 518. Interactionsbetween the user 306 and the healthcare cognitive system 500 will beelectronic via a user computing device (not shown), such as a clientcomputing device 310 or 312 in FIG. 3, communicating with the healthcarecognitive system 500 via one or more data communication links andpotentially one or more data networks.

As shown in FIG. 5, in accordance with one illustrative embodiment, apatient 502 presents symptoms 504 of a medical malady or condition to auser 506, such as a healthcare practitioner, technician, or the like.The user 506 may interact with the patient 502 via a question 514 andresponse 516 exchange where the user gathers more information about thepatient 502, the symptoms 504, and the medical malady or condition ofthe patient 502. It should be appreciated that the questions/responsesmay in fact also represent the user 506 gathering information from thepatient 502 using various medical equipment, e.g., blood pressuremonitors, thermometers, wearable health and activity monitoring devicesassociated with the patient such as a FitBit™, a wearable heart monitor,or any other medical equipment that may monitor one or more medicalcharacteristics of the patient 502. In some cases such medical equipmentmay be medical equipment typically used in hospitals or medical centersto monitor vital signs and medical conditions of patients that arepresent in hospital beds for observation or medical treatment.

In response, the user 502 submits a request 508 to the healthcarecognitive system 500, such as via a user interface on a client computingdevice that is configured to allow users to submit requests to thehealthcare cognitive system 500 in a format that the healthcarecognitive system 500 can parse and process. The request 508 may include,or be accompanied with, information identifying patient attributes 518.These patient attributes 518 may include, for example, an identifier ofthe patient 502 from which patient EMRs 522 for the patient may beretrieved, demographic information about the patient, the symptoms 504,and other pertinent information obtained from the responses 516 to thequestions 514 or information obtained from medical equipment used tomonitor or gather data about the condition of the patient 502. Anyinformation about the patient 502 that may be relevant to a cognitiveevaluation of the patient by the healthcare cognitive system 500 may beincluded in the request 508 and/or patient attributes 518.

The healthcare cognitive system 500 provides a cognitive system that isspecifically configured to perform an implementation specific healthcareoriented cognitive operation. In the depicted example, this healthcareoriented cognitive operation is directed to providing a treatmentrecommendation 528 to the user 506 to assist the user 506 in treatingthe patient 502 based on their reported symptoms 504 and otherinformation gathered about the patient 502 via the question 514 andresponse 516 process and/or medical equipment monitoring/data gathering.The healthcare cognitive system 500 operates on the request 508 andpatient attributes 518 utilizing information gathered from the medicalcorpus and other source data 526, treatment guidance data 524, and thepatient EMRs 522 associated with the patient 502 to generate one or moretreatment recommendation 528. The treatment recommendations 528 may bepresented in a ranked ordering with associated supporting evidence,obtained from the patient attributes 518 and data sources 522-526,indicating the reasoning as to why the treatment recommendation 528 isbeing provided and why it is ranked in the manner that it is ranked.

For example, based on the request 508 and the patient attributes 518,the healthcare cognitive system 500 may operate on the request, such asby using a QA pipeline type processing as described herein, to parse therequest 508 and patient attributes 518 to determine what is beingrequested and the criteria upon which the request is to be generated asidentified by the patient attributes 518, and may perform variousoperations for generating queries that are sent to the data sources522-526 to retrieve data, generate candidate treatment recommendations(or answers to the input question), and score these candidate treatmentrecommendations based on supporting evidence found in the data sources522-526. In the depicted example, the patient EMRs 522 is a patientinformation repository that collects patient data from a variety ofsources, e.g., hospitals, laboratories, physicians' offices, healthinsurance companies, pharmacies, etc. The patient EMRs 522 store variousinformation about individual patients, such as patient 502, in a manner(structured, unstructured, or a mix of structured and unstructuredformats) that the information may be retrieved and processed by thehealthcare cognitive system 500. This patient information may comprisevarious demographic information about patients, personal contactinformation about patients, employment information, health insuranceinformation, laboratory reports, physician reports from office visits,hospital charts, historical information regarding previous diagnoses,symptoms, treatments, prescription information, etc. Based on anidentifier of the patient 502, the patient's corresponding EMRs 522 fromthis patient repository may be retrieved by the healthcare cognitivesystem 500 and searched/processed to generate treatment recommendations528.

The treatment guidance data 524 provides a knowledge base of medicalknowledge that is used to identify potential treatments for a patientbased on the patient's attributes 518 and historical informationpresented in the patient's EMRs 522. This treatment guidance data 524may be obtained from official treatment guidelines and policies issuedby medical authorities, e.g., the American Medical Association, may beobtained from widely accepted physician medical and reference texts,e.g., the Physician's Desk Reference, insurance company guidelines, orthe like. The treatment guidance data 524 may be provided in anysuitable form that may be ingested by the healthcare cognitive system500 including both structured and unstructured formats.

In some cases, such treatment guidance data 524 may be provided in theform of rules that indicate the criteria required to be present, and/orrequired not to be present, for the corresponding treatment to beapplicable to a particular patient for treating a particular symptom ormedical malady/condition. For example, the treatment guidance data 524may comprise a treatment recommendation rule that indicates that for atreatment of Decitabine, strict criteria for the use of such a treatmentis that the patient 502 is less than or equal to 60 years of age, hasacute myeloid leukemia (AML), and no evidence of cardiac disease. Thus,for a patient 502 that is 59 years of age, has AML, and does not haveany evidence in their patient attributes 518 or patient EMRs indicatingevidence of cardiac disease, the following conditions of the treatmentrule exist:

-   -   Age<=60 years=59 (MET);    -   Patient has AML=AML (MET); and    -   Cardiac Disease=false (MET)        Since all of the criteria of the treatment rule are met by the        specific information about this patient 502, then the treatment        of Decitabine is a candidate treatment for consideration for        this patient 502. However, if the patient had been 69 years old,        the first criterion would not have been met and the Decitabine        treatment would not be a candidate treatment for consideration        for this patient 502. Various potential treatment        recommendations may be evaluated by the healthcare cognitive        system 500 based on ingested treatment guidance data 524 to        identify subsets of candidate treatments for further        consideration by the healthcare cognitive system 500 by scoring        such candidate treatments based on evidential data obtained from        the patient EMRs 522 and medical corpus and other source data        526.

For example, data mining processes may be employed to mine the data insources 522 and 526 to identify evidential data supporting and/orrefuting the applicability of the candidate treatments to the particularpatient 502 as characterized by the patient's patient attributes 518 andEMRs 522. For example, for each of the criteria of the treatment rule,the results of the data mining provides a set of evidence that supportsgiving the treatment in the cases where the criterion is “MET” and incases where the criterion is “NOT MET.” The healthcare cognitive system500 processes the evidence in accordance with various cognitive logicalgorithms to generate a confidence score for each candidate treatmentrecommendation indicating a confidence that the corresponding candidatetreatment recommendation is valid for the patient 502. The candidatetreatment recommendations may then be ranked according to theirconfidence scores and presented to the user 506 as a ranked listing oftreatment recommendations 528. In some cases, only a highest ranked, orfinal answer, is returned as the treatment recommendation 528. Thetreatment recommendation 528 may be presented to the user 506 in amanner that the underlying evidence evaluated by the healthcarecognitive system 500 may be accessible, such as via a drilldowninterface, so that the user 506 may identify the reasons why thetreatment recommendation 528 is being provided by the healthcarecognitive system 500.

In accordance with the illustrative embodiments herein, the healthcarecognitive system 500 is augmented to include a question profileself-training engine 540. The question profile self-training engine 540comprises logic elements 542-550 which operate in a similar manner aspreviously described above with regard to corresponding elements 322-330in FIG. 3. The question profile self-training engine 540 providestrained model(s) 560 to the healthcare cognitive system 500 forimplementation during runtime processing of requests 508, based onsupplied patient attributes 518, to provide treatment recommendations528. The trained model(s) 560 may be implemented by the healthcarecognitive system 500 to evaluate the various data from corpora 522, 524,and 526 to generate such treatment recommendations 528. Moreover, asshown, the patient EMRs 522 corpus, or a set of one or more patient EMRsfor one or more patients in the patient EMR 522 corpus, may be used as atraining data set that is input to the question profile basedself-training engine 540 to perform training, by the self-training modellogic 544, of the model(s) 560 using an iterative self-training processin which the labeled training set (L) 542 is iteratively expanded basedon the QA pairs in the unlabeled set (U) 546 and the question profiles548 and analysis performed by the labeled training set expansion logic550.

While FIG. 5 is depicted with an interaction between the patient 502 anda user 506, which may be a healthcare practitioner such as a physician,nurse, physician's assistant, lab technician, or any other healthcareworker, for example, the illustrative embodiments do not require such.Rather, the patient 502 may interact directly with the healthcarecognitive system 500 without having to go through an interaction withthe user 506 and the user 506 may interact with the healthcare cognitivesystem 500 without having to interact with the patient 502. For example,in the first case, the patient 502 may be requesting 508 treatmentrecommendations 528 from the healthcare cognitive system 500 directlybased on the symptoms 504 provided by the patient 502 to the healthcarecognitive system 500. Moreover, the healthcare cognitive system 500 mayactually have logic for automatically posing questions 514 to thepatient 502 and receiving responses 516 from the patient 502 to assistwith data collection for generating treatment recommendations 528. Inthe latter case, the user 506 may operate based on only informationpreviously gathered and present in the patient EMR 522 by sending arequest 508 along with patient attributes 518 and obtaining treatmentrecommendations in response from the healthcare cognitive system 500.Thus, the depiction in FIG. 5 is only an example and should not beinterpreted as requiring the particular interactions depicted when manymodifications may be made without departing from the spirit and scope ofthe present invention. It should be appreciated, however, that at notime should the treatment itself be administered to the patient 502without prior approval of the healthcare professional treating thepatient, i.e. final determinations as to treatments given to a patientwill always fall on the healthcare professional with the mechanisms ofthe illustrative embodiments serving only as an advisory tool for thehealthcare professional (user 506) and/or patient 502.

As mentioned above, the healthcare cognitive system 500 may include arequest processing pipeline, such as request processing pipeline 308 inFIG. 3, which may be implemented, in some illustrative embodiments, as aQuestion Answering (QA) pipeline. The QA pipeline may receive an inputquestion, such as “what is the appropriate treatment for patient P?”, ora request, such as “diagnose and provide a treatment recommendation forpatient P.”

Thus, the illustrative embodiments provide mechanisms for performingautomatic expansion of a labeled training set of QA pairs for improvingthe training of machine learning models implemented by cognitive systemsto perform cognitive operations. The illustrative embodimentssignificantly reduce the amount of human effort needed to perform suchtraining by provided automated processes that iteratively expand from arelatively small ground truth set of training questions and answers, toobtain a large set of training questions and answers that are morerepresentative of the variety of questions and corresponding answersthat a user may ask of a cognitive system in order to obtain aparticular answer. This allows the trained model to more accuratelyhandle the various possible questions that may be asked of it whenperforming question answering operations or otherwise responding torequests for information. Hence, similar result of accurate processingof questions and requests is achieved with significantly less investmentof human effort and expenditure of time and resources.

FIG. 6 is a flowchart outlining an example operation for performing aniterative question profile based self-training operation in accordancewith one illustrative embodiment. The operation outlined in FIG. 6 maybe implemented, for example, by a question profile self-training engineimplemented in one or more computing devices, such as question profileself-training engine 320 in FIG. 3 or 540 in FIG. 5, for example.

As shown in FIG. 6, the operation starts by receiving an initial labeledtraining set (L) comprising an initial ground truth set of QA pairs(step 610). A set of unanswered questions is also received (step 620)and an unlabeled set (U) is generated that has QA pairs comprising, foreach unanswered question, a set of QA pairs that pair the unansweredquestion with each possible answer in a training data set (step 630). Amachine learning model is trained using the current state of the labeledtraining set, e.g., either the initial labeled training set L or anexpanded labeled training set L′ that has been expanded based on aprevious iteration of the operation outlined in FIG. 6 (step 640).

A determination is made as to whether a convergence state has beenachieved (step 650). This convergence state may be, for example, thattraining of the model has not results in an appreciable change in any ofthe operational parameters of the model, e.g., equal to or above athreshold amount of change. Convergence may also be determined if all ofthe unanswered questions have had at least one corresponding QA pairadded to the labeled training set L. In some cases, convergence may be apredetermined number of iterations. Other convergence conditions mayalso be used without departing from the spirit and scope of the presentinvention.

If convergence has been achieved, then the trained model is output tothe cognitive system for implementation during performance of cognitiveoperations (step 655). If convergence has not been achieved, then thetrained model is applied to the unlabeled set U to generate confidencescores for the QA pairs (step 660). Question profiles are associatedwith the unanswered questions (step 670) and, for each unansweredquestion still represented in the unlabeled set U, dynamic selectioncriteria are generated based on the associated question profiles, e.g.,dynamically generated confidence score thresholds and/or number of QApairs to be selected (step 680). In some cases, as noted above, thisdynamic generation of selection criteria may result in a default beingutilized.

QA pairs are then selected for the unanswered questions based on theselection criteria and the confidence scores associated with the QApairs (step 690). The selected QA pairs are added to the labeledtraining set L to generate an expanded labeled training set L′ (which isthen considered to be the new labeled training set L) (step 700) and theoperation returns to step 640 where the process is repeated. Thisprocess is repeated until the convergence condition is encountered (step650) as discussed above.

As noted above, it should be appreciated that the illustrativeembodiments may take the form of an entirely hardware embodiment, anentirely software embodiment or an embodiment containing both hardwareand software elements. In one example embodiment, the mechanisms of theillustrative embodiments are implemented in software or program code,which includes but is not limited to firmware, resident software,microcode, etc.

A data processing system suitable for storing and/or executing programcode will include at least one processor coupled directly or indirectlyto memory elements through a communication bus, such as a system bus,for example. The memory elements can include local memory employedduring actual execution of the program code, bulk storage, and cachememories which provide temporary storage of at least some program codein order to reduce the number of times code must be retrieved from bulkstorage during execution. The memory may be of various types including,but not limited to, ROM, PROM, EPROM, EEPROM, DRAM, SRAM, Flash memory,solid state memory, and the like.

Input/output or I/O devices (including but not limited to keyboards,displays, pointing devices, etc.) can be coupled to the system eitherdirectly or through intervening wired or wireless I/O interfaces and/orcontrollers, or the like. I/O devices may take many different formsother than conventional keyboards, displays, pointing devices, and thelike, such as for example communication devices coupled through wired orwireless connections including, but not limited to, smart phones, tabletcomputers, touch screen devices, voice recognition devices, and thelike. Any known or later developed I/O device is intended to be withinthe scope of the illustrative embodiments.

Network adapters may also be coupled to the system to enable the dataprocessing system to become coupled to other data processing systems orremote printers or storage devices through intervening private or publicnetworks. Modems, cable modems and Ethernet cards are just a few of thecurrently available types of network adapters for wired communications.Wireless communication based network adapters may also be utilizedincluding, but not limited to, 802.11 a/b/g/n wireless communicationadapters, Bluetooth wireless adapters, and the like. Any known or laterdeveloped network adapters are intended to be within the spirit andscope of the present invention.

The description of the present invention has been presented for purposesof illustration and description, and is not intended to be exhaustive orlimited to the invention in the form disclosed. Many modifications andvariations will be apparent to those of ordinary skill in the artwithout departing from the scope and spirit of the describedembodiments. The embodiment was chosen and described in order to bestexplain the principles of the invention, the practical application, andto enable others of ordinary skill in the art to understand theinvention for various embodiments with various modifications as aresuited to the particular use contemplated. The terminology used hereinwas chosen to best explain the principles of the embodiments, thepractical application or technical improvement over technologies foundin the marketplace, or to enable others of ordinary skill in the art tounderstand the embodiments disclosed herein.

What is claimed is:
 1. A method in a data processing system comprising aprocessor and a memory, the memory comprising instructions which causethe processor to be configured to implement a self-training engine of aquestion and answer system, the method comprising: pairing, by theself-training engine, an unanswered natural language question withportions of an electronic document to generate an unlabeled data set;training, by the self-training engine, a model based on a labeled dataset comprising a finite number of question and answer pair datastructures, to thereby generate a trained model; applying, by theself-training engine, the trained model to the unlabeled data set toidentify an answer from the portions of the electronic document to theunanswered natural language question; modifying, by the self-trainingengine, the labeled data set to include the identified answer andcorresponding unanswered natural language question as an additionalquestion and answer pair data structure; training, by the self-trainingengine, the trained model based on the modified labeled data set,wherein the unanswered natural language question is a question differentfrom questions in the finite number of questions and answer pair datastructures of the labeled data set; generating, for each portion of theelectronic document paired with the unanswered natural language questionin the unlabeled data set, a confidence value based on application ofthe trained model to the unlabeled data set; retrieving, for theunanswered natural language question in the unlabeled data set, aquestion profile specifying first characteristics of a question in thelabeled data set, and second characteristics of a correct answer to thequestion in the labeled data set, that corresponds to the unanswerednatural language question based on a similarity evaluation of theunanswered natural language question with the questions in the questionand answer pair data structures of the labeled data set; and dynamicallygenerating a threshold confidence value for the unanswered naturallanguage question based on the first and second characteristics in thequestion profile.
 2. The method of claim 1, wherein the electronicdocument is a patient electronic medical record (EMR), and wherein theportions are clinical notes in the patient EMR.
 3. The method of claim1, wherein retrieving the question profile comprises: calculating, foreach question in the labeled data set, a similarity score measuring asimilarity of first features of the question to second features of theunanswered natural language question; selecting a question from thelabeled data set having a highest similarity score; and retrieving aprofile associated with the selected question as the question profile.4. The method of claim 1, further comprising: comparing, for each of theportions of the electronic document paired with the unanswered naturallanguage question, a confidence value of the portion to the dynamicallygenerated threshold confidence value; and identifying the answer for theunanswered natural language question from a portion of the electronicdocument whose corresponding confidence value meets a requirementspecified by the threshold confidence value.
 5. The method of claim 4,wherein identifying the answer for the unanswered natural languagequestion from a portion of the electronic document whose correspondingconfidence value meets a requirement specified by the thresholdconfidence value comprises selecting a plurality of answers from aplurality of portions of the electronic document, wherein a number ofanswers in the plurality of answers is determined based on thecharacteristics in the question profile.
 6. The method of claim 1,wherein the first characteristics of the unanswered natural languagequestion comprise at least one of key entities associated with theunanswered natural language question, relations between key entitiesassociated with the unanswered natural language question, an answertype, or an indicator of whether the answer is likely to be found in asingle portion or across multiple portions of an electronic document. 7.The method of claim 1, wherein the question profile further comprises anindicator of all other questions in the unlabeled data set and labeleddata set that are similar to the unanswered natural language question,wherein similar questions are questions whose answers are the sameanswer.
 8. The method of claim 1, further comprising: clusteringquestions in the unlabeled data set based on comparisons of features ofthe questions in the unlabeled data set with at least one of the firstand second characteristics of the question profile to generate aquestion cluster comprising similar questions, wherein similar questionsare questions whose answers are the same answer; and propagating a labelassociated with a question answer pair of a first question in thequestion cluster to all other questions in the question cluster.
 9. Acomputer program product comprising a computer readable storage mediumhaving a computer readable program stored therein, wherein the computerreadable program, when executed on a computing device, causes thecomputing device to: pair an unanswered natural language question withportions of an electronic document to generate an unlabeled data set;train a model based on a labeled data set comprising a finite number ofquestion and answer pair data structures, to thereby generate a trainedmodel; apply the trained model to the unlabeled data set to identify ananswer from the portions of the electronic document to the unanswerednatural language question; modify the labeled data set to include theidentified answer and corresponding unanswered natural language questionas an additional question and answer pair data structure; train thetrained model based on the modified labeled data set, wherein theunanswered natural language question is a question different fromquestions in the finite number of questions and answer pair datastructures of the labeled data set; generate, for each portion of theelectronic document paired with the unanswered natural language questionin the unlabeled data set, a confidence value based on application ofthe trained model to the unlabeled data set; retrieve, for theunanswered natural language question in the unlabeled data set, aquestion profile specifying first characteristics of a question in thelabeled data set, and second characteristics of a correct answer to thequestion in the labeled data set, that corresponds to the unanswerednatural language question based on a similarity evaluation of theunanswered natural language question with the questions in the questionand answer pair data structures of the labeled data set; and dynamicallygenerate a threshold confidence value for the unanswered naturallanguage question based on the first and second characteristics in thequestion profile.
 10. The computer program product of claim 9, whereinthe electronic document is a patient electronic medical record (EMR),and wherein the portions are clinical notes in the patient EMR.
 11. Thecomputer program product of claim 9, wherein the computer readableprogram further causes the computing device to retrieve the questionprofile at least by: calculating, for each question in the labeled dataset, a similarity score measuring a similarity of first features of thequestion to second features of the unanswered natural language question;selecting a question from the labeled data set having a highestsimilarity score; and retrieving a profile associated with the selectedquestion as the question profile.
 12. The computer program product ofclaim 9, wherein the computer readable program further causes thecomputing device to: compare, for each of the portions of the electronicdocument paired with the unanswered natural language question, aconfidence value of the portion to the dynamically generated thresholdconfidence value; and identify the answer for the unanswered naturallanguage question from a portion of the electronic document whosecorresponding confidence value meets a requirement specified by thethreshold confidence value.
 13. The computer program product of claim11, wherein the computer readable program further causes the computingdevice to identify the answer for the unanswered natural languagequestion from a portion of the electronic document whose correspondingconfidence value meets a requirement specified by the thresholdconfidence value at least by selecting a plurality of answers from aplurality of portions of the electronic document, wherein a number ofanswers in the plurality of answers is determined based on thecharacteristics in the question profile.
 14. The computer programproduct of claim 9, wherein first characteristics of the unanswerednatural language question comprise at least one of key entitiesassociated with the unanswered natural language question, relationsbetween key entities associated with the unanswered natural languagequestion, an answer type, or an indicator of whether the answer islikely to be found in a single portion or across multiple portions of anelectronic document.
 15. The computer program product of claim 9,wherein the question profile further comprises an indicator of all otherquestions in the unlabeled data set and labeled data set that aresimilar to the unanswered natural language question, wherein similarquestions are questions whose answers are the same answer.
 16. Anapparatus comprising: a processor; and a memory coupled to theprocessor, wherein the memory comprises instructions which, whenexecuted by the processor, cause the processor to: pair an unanswerednatural language question with portions of an electronic document togenerate an unlabeled data set; train a model based on a labeled dataset comprising a finite number of question and answer pair datastructures, to thereby generate a trained model; apply the trained modelto the unlabeled data set to identify an answer from the portions of theelectronic document to the unanswered natural language question; modifythe labeled data set to include the identified answer and correspondingunanswered natural language question as an additional question andanswer pair data structure; train the trained model based on themodified labeled data set, wherein the unanswered natural languagequestion is a question different from questions in the finite number ofquestions and answer pair data structures of the labeled data set;generate, for each portion of the electronic document paired with theunanswered natural language question in the unlabeled data set, aconfidence value based on application of the trained model to theunlabeled data set; retrieve, for the unanswered natural languagequestion in the unlabeled data set, a question profile specifying firstcharacteristics of a question in the labeled data set, and secondcharacteristics of a correct answer to the question in the labeled dataset, that corresponds to the unanswered natural language question basedon a similarity evaluation of the unanswered natural language questionwith the questions in the question and answer pair data structures ofthe labeled data set; and dynamically generate a threshold confidencevalue for the unanswered natural language question based on the firstand second characteristics in the question profile.